Graph theoretic analysis of protein interaction networks of eukaryotes 



in 
o 
o 

<N 



K.-I. Goh*, B. Kahng*<t and D. Kim* 
"School of Physics and 'Program in Bioinformatics, 
Seoul National University, Seoul 151-747, Korea 
(Dated: June 15, 2004) 

Thanks to recent progress in high-throughput experimental techniques, the datasets of large-scale 
protein interactions of prototypical multicellular species, the nematode worm Caenorhabditis elegans 
and the fruit fly Drosophila melanogaster, have been assayed. The datasets are obtained mainly 
by using the yeast hybrid method, which contains false-positive and false-negative simultaneously. 
Accordingly, while it is desirable to test such datasets through further wet experiments, here we 
invoke recent developed network theory to test such high throughput datasets in a simple way. 
Based on the fact that the key biological processes indispensable to maintaining life are universal 
across eukaryotic species, and the comparison of structural properties of the protein interaction 
networks (PINs) of the two species with those of the yeast PIN, we find that while the worm and 
the yeast PIN datasets exhibit similar structural properties, the current fly dataset, though most 
comprehensively screened ever, does not reflect generic structural properties correctly as it is. The 
modularity is suppressed and the connectivity correlation is lacking. Addition of interlogs to the 
current fly dataset increases the modularity and enhances the occurrence of triangular motifs as 
well. The connectivity correlation function of the fly, however, remains distinct under such interlogs 
addition, for which we present a possible scenario through an in silico modeling. 
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Introduction 

In the last few years graph theoretic methods to un- 
derstand complex biomolecular systems have been de- 
veloped very rapidly 4]. Such a development has made 
advances toward uncovering the organizing principles of 
cellular networks in post-genomic biology. The cellular 
components such as genes, proteins, and other biological 
molecules, connected by all physiologically relevant in- 
teractions, form a full weblike molecular architecture in 
a cell. In such an architecture, genes play a central role, 
which are expressed through proteins. Proteins rarely act 
alone, rather they cooperate with others to act physio- 
logically. Thus protein interactions play pivotal roles in 
various aspects of the structural and functional organi- 
zations and their complete description would be the first 
step toward a thorough understanding of the web of life. 
Proteins are viewed as nodes of a complex protein inter- 
action network (PIN) in which two proteins are linked if 
they physically contact with each other. The graph the- 
oretic approach has been useful to understand intricate 
interwoven structures of the PIN d [H E3. Th e key 
biological processes indispensable to maintaining life are 
universal across eukaryotic species since many involved 
genes are evolutionarily conserved Q. Using this prop- 
erty, one can test a newly discovered dataset if it really 
contains more or less complete information of protein in- 
teractions. Moreover, this in silico approach offers one 
the candidates of protein interaction pairs, of which the 
number is considerably reduced compared with the total 
combinatorial pairs. Thus, the graphic theoretic analysis 
would provide a useful guide for further wet studies of 
protein interactions. 

Species with sequenced genome such as the yeast 
Saccharomyces cerevisiae provide important test beds 
for the study of the PIN. Thanks to recent progress 
in the high-throughput experimental techniques such 



as the yeast two-hybrid assay H, |2l| and the mass 
spectroscopy 0, 0, the dataset of the yeast PIN 
has been firmly established 0, 0|. Very recently, 
large-scale protein interactions of multicellular species, 
the nematode worm Caenorhabditis elegans [lOj and 
the fruit fly Drosophila melanogaster |y, have been 
assayed. While those datasets, mainly based on the 
yeast two-hybrid assay, need physiological proof, they 
contain large-scale proteins and protein interactions, 
making graph theoretic study possible. In this paper, 
we analyze those datasets and compare them with the 
more-established set of interactions in the budding 
yeast 17:]. Our graph theoretic analysis suggests that 
the present interaction dataset of the fruit fly, based 
on the yeast two-hybrid (Y2H) assay, may have left 
out a significant part of protein interactions, though 
most comprehensively screened ever. Such conclusion 
has been reached by the comparison of the generic 
features of the PIN, the modularity and the connectivity 
correlations, across the three species. For the fly, those 
quantities behave distinctively: The modularity is 
suppressed and the connectivity correlation is lacking. 
Such distinct behavior can be overcome partially by the 
addition of yeast interlogs into the fly dataset. 



Materials and Methods 

Graph theory terminology, (i) Network is composed 
of vertices and edges. In the protein interaction network, 
vertices represent proteins and edges protein interactions, 
(ii) Degree is the number of edges connected to a given 
vertex. The degree distribution Pd{k) is the fraction of 
vertices having k degrees, (iii) Clustering coefficient of 
a node is defined as Cj = 2ej/fej(fcj — 1), where is the 
number of connections among the fc, neighbors of a vertex 
i. Clustering function C(k) is the mean value of Cj over 
the vertices with degree k, while the clustering coefficient 
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Fig. 1: The degree distributions Pd(k) for (a) the yeast, (b) the prokaryotes Helicobacter pylon (o) and Escherichia coli (□), 
(c) the worm (Worm- All), (d) the Y2H subset of the worm dataset (Worm-Y2H), (e) the fly, and (f) the Fly+Interlog dataset. 



C is the mean of Ci over all vertices. When the network 
contains hierarchical and modular structures within it, 
it is known that the clustering function C(k) behaves as 
C(k) ~ k" 13 for large k (iv) (fc nn )(fc) is the mean 

degree of the neighbors of a vertex with degree fc. ft is 
known that (k nn )(k) ~ k~~ v with v > for the Internet 
and the protein interaction network [Til Il5l |. implying 
that vertices with large degree tend to connect to the 
ones with small degree. Such a network is called dissor- 
tative network. Besides this quantity, the ep x r has been 
introduced 0] to characterize the degree-degree correla- 
tion between the two vertices located at the ends of an 
edge, which is denned as 

(k 1 k 2 )-((k 1 + k 2 )/2) 2 



■kl)/2) 



<(fci + fc 2 )/2>5 



where k\ and k 2 are the degrees of two vertices at the 
ends of an edge, and (• • • } denotes the average over all 
edges. 

The protein interaction network datasets. We 

used the yeast subset of the interaction data compiled 
in the Database of interacting Proteins (DfP) as of 
January 2004 fhttp://dip.doe-mbi.ucla.edu| ) [f7|. The 
datasets for the worm and the fly are obtained from the 
works of Li et al. 0] and Giot et al. @, respectively. 
For the worm, we consider two different versions, the 
one consisting of only the interactions from the Y2H 
screens (referred to as Worm-Y2H network in this paper) 
and the other the full network supplied by Li et al. |f0| 
(referred to as Worm- All network). The characteristics 



of each dataset and the values of the graphic theoretic 
quantities are tabulated in Table IT1 

Orthologous gene assignment. For cross-species 
ortholog information, we used the information from 
the KOG database a eukaryotic extension of 

the Clusters of Orthologous Genes (COG) database 
( http: / /www. ncbi.nlm.nih.gov/COG/ new/ ) . 
Yeast interlogs in fly. Having identified the yeast-fly 
orthologs, we look for the interactions in the yeast 
network between those yeast proteins both having 
orthologs in the fly network. Such orthologous inter- 
actions are called the interlogs. If the corresponding- 



Table 1: Protein interaction network datasets. Tab- 
ulated are for each dataset the size of proteome N plo t e0 me, 
the number of proteins N and the number of protein-protein 
interactions L in the dataset, the mean degree (k), the clus- 
tering coefficient C, the assortativity r, and the number of 
proteins forming the largest cluster Ni. The self-interactions 
are excluded throughout. 





Yeast 


Worm-Y2H 


Worm- All 


Fly 


proteome 


6195 


22246 


22246 


16206 


N 


4714 


2835 


3216 


7055 


L 


14857 


4438 


50444 


20947 


(k) 


6.3 


3.1 


3.4 


5.9 


C 


0.12 


0.047 


0.15 


0.014 


r 


-0.14 


-0.16 


-0.13 


-0.036 


Ni 


4627 


2601 


2898 


6929 
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Fig. 2: The local clustering function C(k) for (a) the yeast, (b) the bacteria H. pylori (o) and E. coli (□), (c) the worm 
(Worm- All), (d) the Worm-Y2H dataset, (e) the fly, and (f) the Fly+Interlog dataset. The abscissae and ordinates are fixed 
for clear comparison. 



fly interaction is present, we call it an overlap interlog. 
If not, we call it a potential interlog. Note that the 
ortholog relationship is not always one-to-one, resulting 
in multiple interlogs for a given yeast interaction. For in 
silico analysis on the effect of the addition of potential 
interlogs in the fly network, we include on average one 
potential interlog per yeast interaction. Specifically, for 
each yeast interaction A-B having no overlap interlog, 
each potential interlog is added in the fly network 
with probability 1/(0,405), where ox is the number 
of fly ortholog(s) of the yeast gene X. The network 
obtained in this way is referred to as Fly+Interlog 
network hereafter. The full list of the 408 overlap and 
the 55176 potential interlogs are available on the web 
(http:/ /komplexO.snu.ac.kr/pin/yeast-fly- interlog.xls). 

Results 

Degree distributions. In Fig. ^ we plot the degree 
distributions of diverse protein interaction networks, 
all of which display the scale-free behavior, fitting well 
to the generalized Pareto formula, Pd(k) ~ (k + fco) _7 > 
almost indistinguishable with each other. While the 
degree distribution is a fundamental quantity in graph 
theory, it deals with global network structure, so it does 
not give detailed information on structural property. 

Modularity 

A cellular function is achieved by a set of related 
proteins, usually forming a pathway or a complex. 
Such functional module manifests itself as a localized 
dense subgraph within the whole cellular network. The 



presence of modules and their hierarchical organization 
can be visualized by the local clustering function C{k) 
For the yeast PIN, C(k) exhibits a plateau for small 
k and falls off rapidly for large k, reflecting the modular- 
structure bridged by the hubs (Fig. 2a). The similar 
pattern is observed in the worm (Fig. 2c) and the two 
prokaryotic species, H. pylori and E. coli (Fig. 2b). Note 
that the worm dataset contains the yeast interlogs. For 
the fly Y2H data, however, C(k) behaves distinctively, 
almost constant for all k (Fig. 2e). To understand this 
discrepancy, we add the potential yeast interlogs into the 
current fly Y2H dataset. Then C(k) behaves in a similar 
fashion to other dataset, showing a moderate plateau 
for small k and rapid decrease for large k, albeit the 
altitude of the plateau, which is roughly the clustering 
coefficient C, is not as high as in the yeast and the 
worm (Fig. 2f). To find the role of the interlogs in the 
worm, we consider the Worm-Y2H dataset, and plot 
its C(fc) in Fig. 2d. Indeed, the signature character of 
C(k) is lost, in particular, the plateau for small k almost 
disappears, implying the yeast-interlogs play a role of 
forming modules, where proteins are closely linked each 
other. 

Conservation rate of interactions. We count 
how many yeast interactions are actually conserved in 
orthologous form in both the worm and the fly. The 
conservation rate found in this way for the Y2H screen 
dataset is surprisingly low; 2.7% for the worm (Worm- 
Y2H) and 3.8% for the fly. For the worm, we note that 
such low coverage is in part due to the insufficient num- 
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Fig. 3: Conservation of interaction motif. Shown in the 
middle is a triangular interaction subgraph within the yeast 
involving in ubiquitin-dependent protein catabolism. Corre- 
sponding orthologous counterpart in the worm and the fly are 
also shown. This motif is conserved in the worm Y2H data, 
while only a single interaction is detected in the fly data. 

ber of baits used in the experiment (3,024 baits, 833 out 
of which are present in the network) . When we consider 
the conservation of triangular interaction pat terns, a 
basic unit of cooperative functional module only 3 
out of 1731 are conserved in the worm, while none in the 
fly (Fig. 3). The lack of conserved interaction motifs in 
the fly data suggests that the current fly network misses 
some of important cooperative aspects of the cellular 
network in the fly. The effort to fill this gap is timely. 

Motif structure. Since the modularity manifested by 
C(k) is closely related to the formation of triangles in the 
network, here we further perform network motif analy- 
sis for the three species datasets. The network motifs 
are small recurring subgraphs which are overrepresented 
in a given network and are believed to provide the ba- 
sic evolutionary and functional signatures of the network 
fl2| . Since it was recently discovered that the motif con- 
stituents are more conserved during evolution than the 
rest [24| , one would expect the density of each motif to 
be close to each other across the three species. From the 
comparison of the columns for Yeast, Worm- All, and Fly 
in Table 121 we can see that the triangle motif is relatively 
not abundant in Fly, while the square motif is. Thus, the 
absolute magnitude of the clustering function is smaller 
for the fly than for the yeast or the worm. The den- 
sity of the triangle motif is higher in the Fly+Interlog 
dataset, indicating that the clustering coefficient is en- 
hanced overall by the addition of the interlogs of the fly. 

In Table [2] we have summarized the motif structure 
for each network. We follow Milo et al. 1^ to calcu- 
late the two scores, Z- and S-score, defined as Z — (N — 

^random) / ^random and E (iV -^random) / ^random: re- 
spectively, and use the following two criteria to specify 
whether a subgraph is a motif or an anti-motif (an anti- 
motif is a subgraph significantly underrepresented in the 
network) : 

(i) The probability that N is observed in randomized 
network is smaller than 0.01. 

(ii) \E\ > Eq, where we set the threshold E a = 0.5, 
rather than Eq = 0.1 in Milo et al. 12]. 

Here, N ran dom and <J ran dom are the expected number of 



occurrence in the randomized version of the network and 
their standard deviation obtained from 1000 samples 
respectively, where the randomization is performed by 
the switching method I* 1 calculating them for the 

4- node subgraphs, the numbers of 3-node subgraphs 
are fixed to be those of the original networks. For 
the Fly+Interlog network, 10 realizations of interlog 
addition (see Method) are averaged. 

Degree-degree correlation 

The mean neighbor degree function (fc nn )(fc) is useful 
in understanding the degree-degree correlation in a 
network. In Fig. |4j we plot (k nn )(k) for each dataset. 
For the yeast, it is known that (k nn )(k) decreases with 
increasing k pjj , which turns out to be also true for some 
prokaryotic species, too (Figs. 4a-b). Such a behavior 
in (k nn )(k) is also observed for the worm (Figs. 4c-d), 
however, it is flat for the fly, implying lack of correlation 
(Fig. 4e) . Such distinct behavior for the fly is robust un- 
der the addition of the interlogs (Fig. 4f), which suggests 
the lack of correlation in the fly network could be intrin- 
sic, even though we cannot exclude the possibility that 
it is again the artifact of the incompleteness of the data. 
The hypothesis that the lack of correlation could be 
intrinsic may be supported by the following observations. 

Effect of diversification of gene function on 

(k nn )(k). While the pattern of C(k) of the fly becomes 
similar to those of the yeast and the worm by the ad- 
dition of the interlogs, that of (k nn )(k) remains distinct. 
Thus here we investigate if such a flat behavior is in- 
trinsic through an in silico model, finding that indeed, 
the decreasing behavior of (k nn )(k) becomes moderated 
through the network evolution with the duplication and 
divergence processes. Homologs in a genome are thought 
to result from the gene duplication event, which is usu- 
ally followed by the diversification to lower the redun- 
dancy. Some computer models aiming to mimic these 
processes in proteome evolution exist in the literature 
|l8l 12^ . We investigate how the diversification process 
affects the topological property of the proteome network, 
in particular, the degree-degree correlation in terms of 
{k nn )(k). To this end, we perform following procedures 
motivated by Vazquez et al. |22(: 

1. Starting with the yeast protein network, at each 
step, a protein A is chosen randomly and is du- 
plicated as A'. Then the protein A and A' share 
common neighbors. 

2. For each neighboring protein of A and A', one of 
edges connected to either A or A' is removed with 
equal probability. 

3. Repeat 1-2 until the number of proteins reaches 
^20,000, the approximate sizes of the worm and 
the fly proteome. 

Note that in this procedure, the number of proteins in- 
creases while the number of interactions stays still. Thus 
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Table 2: Network motif structure of the three species. Tabulated is the number of each subgraph present in the network. 
According to its Z- and i?-score, the significant motifs (M) and anti-motifs (AM) are indicated. 
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Fig. 4: The average neighbor degree function (k nn }(k) for (a) the yeast, (b) the prokaryotes H. pylori (o) and E. coli (□), 
(c) the worm (Worm- All), (d) the Y2H subset of the worm (Worm-Y2H), (e) the fly, and (f) the Fly+Interlog dataset. The 
abscissae and ordinates are fixed for clear comparison. 



the average degree decreases as the size of proteome in- 
creases. Such decrease will be compensated by, e.g., the 
acquisition of new interactions between existing proteins 
via mutation. However, we do not take such a process 
into account, to single out the effect of the diversification 
only. 

The result of simulation is shown in Fig. El The local 
clustering function C(k) is simply shifted downward, due 



to the overall decrease of the edge density. On the other 
hand, the average neighbor degree (fc nn )(fc)decreases as 
k but with a smaller rate, indicating that the diversi- 
fication process can, although not perfectly, neutralize 
the connectivity correlation. Furthermore, if we assume 
that the establishment of new interactions follows the 
preferential attachment or random attachment, the 
overall correlation would diminish eventually. 
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Fig. 5: Effect of gene function diversification in (a) C(k) 
and (b) (fc nn )(fe). Red circles are the data of the original 
yeast network and the blue squares those after running the 
diversification procedures in silico. The slope of the straight 
line (the rate of decrease) in (b) is -0.3 (top, green) and -0.15 
(bottom, magenta), respectively. 



Effect of bait selection on (k nn )(k). There has 
been an argument that the apparent decreasing trend 
in (k nn )(k) is an artifact from the limited selection of 
baits in the two- hybrid experiment Q- Indeed, Li et 
al. ^3 na( i selected the baits with their own criteria, 
mainly based on the biological indispensability and 
the potential applicability to the human therapeutics. 
To check this hypothesis in silico, we sampled the 
30% subset of 4950 baits identified in Giot et al.'s fly 
network [|| and reconstructed the network only with 
the interactions associated with the sampled baits. We 
sampled in two different ways; the random sampling 
and the biased sampling toward the highly connected 
baits (the sampling probability is proportional to the 
number of bait-interactions). Both data sets generate 
the decreasing trend in {k nn )(k) (Fig. |5J). One can see 



that even though the original network has the null slope 
in (k nn )(k), the negative slope develops in the sampled 
ones, demonstrating that the insufficient use of the bait 
can produce artifactual correlation in the connectivity. 
If this scenario holds, one conjecture that (k nn )(k) curve 
will become flatter as the interaction data accumulates 
and becomes more complete. 

Summary and discussion 

We have investigated in detail the structural properties 
of the protein interaction networks of three eukaryotic 
species, the budding yeast, the nematode worm, and the 
fruit fly. In particular, we have focused on the compara- 
tive assessment of the modularity and the degree-degree 
correlation for those networks. We found that while 
the worm dataset behaves similarly to the yeast for the 
two graph theoretic quantities, the fly does not. The 
difference might be attributed to the presence (absence) 
of the yeast-inter logs in the current worm (fly) dataset. 
For the fly dataset, the modularity is suppressed and the 
connectivity correlation is lacking. We found that the 
clustering function can be restored to those of the yeast 
dataset by the addition of interlogs selected randomly 
among the candidates to the current dataset. We also 
performed motif analysis for the three species, finding 
that the density of the triangle motif is increased by 
the addition of the interlogs to the current fly dataset. 
Finally, the candidates of the protein interactions of the 
fly are provided in the supplementary materials, which 
could be useful in finding protein interactions missed in 
the current fly dataset. 

This work is supported by the KOSEF grant No. R14- 
2002-059-01000-0 in the ABRL program and the MOST 
grant No. Ml 03B500000110. 
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Fig. 6: Effect of bait selection. Red circle is for the full data, green diamond the randomly sampled one, blue square the biased 
sampled one. 
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